A fast algorithm for genome-wide analysis of proteins with repeated sequences.

نویسندگان

  • M Pellegrini
  • E M Marcotte
  • T O Yeates
چکیده

We present a fast algorithm to search for repeating fragments within protein sequences. The technique is based on an extension of the Smith-Waterman algorithm that allows the calculation of sub-optimal alignments of a sequence against itself. We are able to estimate the statistical significance of all sub-optimal alignment scores. We also rapidly determine the length of the repeating fragment and the number of times it is found in a sequence. The technique is applied to sequences in the Swissprot database, and to 16 complete genomes. We find that eukaryotic proteins contain more internal repeats than those of prokaryotic and archael organisms. The finding that 18% of yeast sequences and 28% of the known human sequences contain detectable repeats emphasizes the importance of internal duplication in protein evolution.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

I-49: Human Y Chromosome ProteomeProject

The success of the Human Genome Project (HGP) has provided a blueprint for the approximately 20,000 gene-encoded proteins potentially active in all of the hundreds of cell types that make up the human body. Yet we still have limited knowledge about a majority of the gene-encoded proteins which are the “building blocks of life” and “cellular machinery”. It is estimated that for nearly half of th...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

Protein Databases

Proteins are sources of many peptides with diverse biological activity. Some of them are considered as valuable components of foods and drug targets with desired and designed biological activity. We are now entering an era rich in biological data in which the field of bioinformatics is poised to exploit this information in increasingly powerful ways. There are currently many databases all over ...

متن کامل

Evaluation of First and Second Markov Chains Sensitivity and Specificity as Statistical Approach for Prediction of Sequences of Genes in Virus Double Strand DNA Genomes

Growing amount of information on biological sequences has made application of statistical approaches necessary for modeling and estimation of their functions. In this paper, sensitivity and specificity of the first and second Markov chains for prediction of genes was evaluated using the complete double stranded  DNA virus. There were two approaches for prediction of each Markov Model parameter,...

متن کامل

Profile of Eight Prophage Sequences Present in the Genomes of Different Acinetobacter baumannii Strains

ABSTRACT           Background and Objective: Prophage sequences are major contributors to interstrain variations within the same bacterial species. Acinetobacter baumannii is a gram-negative bacterium that causes a wide range of nosocomial infections, especially in intensive care unit inpatients. Prophage sequences constitute a considerable proporti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Proteins

دوره 35 4  شماره 

صفحات  -

تاریخ انتشار 1999